{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在[快速开始](https://github.com/DataCanvasIO/HyperTS/blob/main/examples/zh_CN/02_quick_start.ipynb)中, 我们了解到HyperTS的基本建模方式:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hyperts.experiment import make_experiment\n",
    "from hyperts.datasets import load_network_traffic\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "df = load_network_traffic()\n",
    "train_data, test_data = train_test_split(df, test_size=168, shuffle=False)\n",
    "\n",
    "experiment = make_experiment(train_data,\n",
    "                            task='forecast',\n",
    "                            timestamp='TimeStamp',\n",
    "                            covariables=['HourSin', 'WeekCos', 'CBWD'])\n",
    "model = experiment.run()\n",
    "\n",
    "X_test, y_test = model.split_X_y(test_data)\n",
    "forecast = model.predict(X_test)\n",
    "scores = model.evaluate(y_true=y_test, y_pred=forecast)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了您更好的掌握HyperTS的使用技巧, 本NoteBook将展开详细地讲解```make_experiment```, 以期您可以发掘出它更加鲁棒的性能表现。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 以缺省配置创建一个实验\n",
    "2. 选择运行模式\n",
    "3. 指定模型的评估指标\n",
    "4. 指定优化方向\n",
    "5. 设置最大搜索次数\n",
    "6. 设置早停策略\n",
    "7. 指定验证数据集\n",
    "8. 指定搜索算法\n",
    "9. 指定时间频率\n",
    "10. 指定预测窗口\n",
    "11. 固定随机种子\n",
    "12. 调整日志级别\n",
    "13. 不连续序列预测\n",
    "14. 无时间列时序预测\n",
    "15. 预测数据截断\n",
    "16. 交叉验证\n",
    "17. 模型融合"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. 以缺省配置创建一个实验(default)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先，必须告诉实验我们将做一个什么类型的任务，即给参数```task```赋值。\n",
    "其次，在预测任务中，我们必须向```make_experiment```传入参数```timestamp```的列名。如果存在协变量，也需要传入```covariables```的列名。在分类任务中，数据的目标列如果不是y或者target的话，需要通过参数```target```的设置。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#预测\n",
    "experiment = make_experiment(train_data, \n",
    "                             task='forecast',\n",
    "                             timestamp='TimeStamp',\n",
    "                             covariables=['HourSin', 'WeekCos', 'CBWD'])\n",
    "#分类\n",
    "experiment = make_experiment(train_data, task='classification', target='y')  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据集相关信息请参考[这里](https://github.com/DataCanvasIO/HyperTS/blob/main/hyperts/datasets/base.py)。\n",
    "\n",
    "\n",
    "**注意事项**\n",
    "\n",
    "对于时序预测任务，按照预测变量的数量可能划分为单变量预测和多变量预测。对于时序分类任务，按照特征变量的数量及类别的数据可划分为单变量二分类，单变量多分类，多变量二分类及多变量多分类。如果我们在拿到数据后已经清楚数据及所解决任务的基本情况，建议在配置```task```传入以下参数：\n",
    "\n",
    "- 单变量预测: task='unvariate-forecast';\n",
    "- 多变量预测：task='multivariate-forecast';\n",
    "- 单变量二分类: task='univariate-binaryclass';\n",
    "- 单变量多分类: task='univariate-multiclass';\n",
    "- 多变量二分类: task='multivariate-binaryclass';\n",
    "- 多变量多分类: task='multivariate-multiclass'.\n",
    "  \n",
    "当然，也可以简单配置 task='forecast', task='classification'及task='regression'，这样HyperTS将从数据中结合其他已知列信息进行详细的任务类型推断。\n",
    "\n",
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. 选择运行模式(mode)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "HyperTS内置了三种运行模式，分别为统计模型模式('stats')，深度学习模式('dl')以及神经架构搜索模式('nas', 未开放)。缺省情况下，默认选择统计模型模式，您也可以更改为其他模式："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             mode='dl',\n",
    "                             ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "深度学习模式基于Tensorflow可支持GPU，缺省情况下，默认实验将在CPU环境下运行。如果您的设备支持GPU并安装了gpu版本的tensorflow-gpu，可以更改参数```tf_gpu_usage_strategy```："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             mode='dl',\n",
    "                             tf_gpu_usage_strategy=1,\n",
    "                             ...)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "其中，```tf_gpu_usage_strategy```支持三种配置策略，分别为：\n",
    "- 0: CPU下运行;\n",
    "- 1: GPU内存容量依据模型规模及运行情况增长;\n",
    "- 2: GPU内存容量限制最大容量，默认为2048M，参数```tf_memory_limit```支持自定义配置。\n",
    "\n",
    "-------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. 指定模型的评估指标(reward_metric)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当使用```make_experiment```创建实验时，缺省情况下，预测任务默认的模型评估指标是'mae'，分类任务是'accuracy'，回归任务默认是'rmse'。您可以通过参数```reward_metric```重新指定评估指标, 可以是'str'也可以是```sklearn.metrics```内置函数，示例如下:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# str\n",
    "experiment = make_experiment(train_data, \n",
    "                             task='univariate-binaryclass',\n",
    "                             reward_metric='auc',\n",
    "                             ...)  \n",
    "\n",
    "# sklearn.metrics\n",
    "from sklearn.metrics import auc\n",
    "experiment = make_experiment(train_data, \n",
    "                             task='univariate-binaryclass',\n",
    "                             reward_metric=auc,\n",
    "                             ...)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "目前，```reward_metric```可以支持多种评估指标，具体如下：\n",
    "  - 分类: accuracy, auc, f1, precision, recall, logloss。\n",
    "  - 预测及回归: mae, mse, rmse, mape, smape, msle, r2。\n",
    "\n",
    "-----------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4. 指定优化方向(optimize_direction)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在模型搜索阶段，需要给搜索者指定搜索方向，在缺省情况下，默认将从```reward_metric```中检测。您也可以通过参数```optimize_direction```进行指定('min'或者'max')："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             task='univariate-binaryclass',\n",
    "                             reward_metric='auc',\n",
    "                             optimize_direction='max',\n",
    "                             ...)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5. 设置最大搜索次数(max_trials)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "缺省情况下，```make_experiment```所创建的实验搜索3种参数模型便停止搜索。实际使用中，建议将最大搜索次数设置为30以上,时间充裕的话，更大的搜索次数将有更高的机率获得更加优秀的模型:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             max_trials=100,\n",
    "                             ...) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6. 设置早停策略(early_stopping)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当```max_trials```设置比较大时，可能需要更多的时间等待实验运行完毕。为了把控工作的节奏，您可以通过```make_experiment```的早停机制(Early Stopping)进行控制："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             max_trials=100,\n",
    "                             early_stopping_time_limit=3600 * 3,  # 将搜索时间设置为最多3个小时\n",
    "                             ...)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "其中，```make_experiment```共包含了三种早停机制，可以搭配使用。分别为：\n",
    "- early_stopping_time_limit: 限制实验的运行时间，粒度为秒。\n",
    "- early_stopping_round: 限制实验的搜索轮数，粒度为次。\n",
    "- early_stopping_reward: 指定一个奖励得分的界限。\n",
    "\n",
    "-------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7. 指定验证数据集(eval_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "模型训练除了需要训练数据集，还需要评估数据集，缺省情况下将从训练数据集中以一定比例切分一部分评估数据集。您也可在```make_experiment```时通过eval_data指定评估集，如："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             eval_data=eval_data,\n",
    "                             ...) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然，您也可以通过设置```eval_size```自己指定评估数据集的大小:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             eval_size=0.3,\n",
    "                             ...) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 8. 指定搜索算法(searcher)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "HyperTS通过[Hypernets](https://github.com/DataCanvasIO/Hypernets)中内置的搜索算法进行模型选择和超参数优化，其中包括EvolutionSearcher(缺省, 'evolution')、MCTSSearcher('mcts')、以及RandomSearch('random')等。在使用```make_experiment```时，可通过参数```searcher```指定，指定搜索算法的类名(class)或者搜索算法的名称(str):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             searcher='random',\n",
    "                             ...)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "各种搜索算法详细介绍可参考[搜索算法](https://hypernets.readthedocs.io/en/latest/searchers.html)。\n",
    "\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 9. 指定时间频率(freq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在时序预测任务中，如果我们已知数据集的时间频率，您可以通过参数```freq```来精确化指定:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             task='forecast',\n",
    "                             timestamp='TimeStamp',\n",
    "                             freq='H',\n",
    "                             ...) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "缺省情况下，频率将依据```timestamp```进行推断。\n",
    "\n",
    "----------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10. 指定预测窗口(forecast_window)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当使用深度学习模式进行时序预测时，您可以结合经验对数据的实际情况分析后，通过参数```forecast_window```指定滑动窗口的大小:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             task='forecast',\n",
    "                             mode='dl',\n",
    "                             timestamp='TimeStamp',\n",
    "                             forecast_window=24*7,\n",
    "                             ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 11. 固定随机种子(random_state)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有时为了保证实验结果可以复现，我们需要保持相同的初始化，此时，您可以通过参数```random_state```固定随机种子:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             random_state=0,\n",
    "                             ...)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 12. 调整日志级别(log_level)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果希望在训练过程中看到使用进度信息的话，可通过log_level指定日志级别。关于日志级别的详细定义可参考python的logging包。 另外，如果将verbose设置为1的话，可以得到更详细的信息。例如，将日志级别设置为'INFO'："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data, \n",
    "                             log_level='INFO', \n",
    "                             verbose=1,\n",
    "                             ...)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 13. 不连续序列预测(freq-null)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在某些时序预测任务中, 可能没有规律性的时间频率, 即非连续采样。此时, 您可以通过设置参数 ``freq='null'`` 及 ``mode='dl'`` 来告知 ``experiment`` 数据的这个属性:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data,\n",
    "                            task='forecast',\n",
    "                            timestamp='TimeStamp',\n",
    "                            freq='null',\n",
    "                            ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "此时, HyperTS将调用深度学习模式(DL only)来针对该数据进行时序预测。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 14.  无时间列时序预测(timestamp-null)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在某些时序预测数据中, 可能没有存储时间列 ``timestamp``, 即只包含目标列以及协变量列等特征。此时, 您可以通过设置参数 ``timestamp='null'`` 来告知 ``experiment`` 数据的这个属性以解耦时间列:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data,\n",
    "                            task='forecast',\n",
    "                            timestamp='null',\n",
    "                            ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "此外, 如果已知数据的采样频率, 建议通过参数 ``freq`` 指定, 这样将有助于数据的预处理。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 15. 预测数据截断(forecast_train_data_periods)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于某些存在很长历史数据的时序预测任务, 使用全部数据建模, 历史数据可能不符合未来数据的序列特性而且也会增加模型的训练成本。此时, 您可以通过参数 ``forecast_train_data_periods`` 来从训练数据末端向前截取一定周期的数据进行训练:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data,\n",
    "                            task='forecast',\n",
    "                            mode='stats',\n",
    "                            timestamp='TimeStamp',\n",
    "                            forecast_train_data_periods=24*10,\n",
    "                            ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 16. 交叉验证(cross validation)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了增强模型的鲁棒性, 可通过参数 ``cv`` 指定是否启用交叉验证。当 ``cv`` 设置为 ``True`` 时表示开启交叉验证, 折数可通过参数 ``num_folds`` 设置(默认: 3)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data,\n",
    "                            cv==True,\n",
    "                            num_folds=5,\n",
    "                            ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 17. 模型融合(ensemble_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了获取较好的模型效果, ``make_experiment`` 创建实验时可以开启模型融合的特性, 即通过参数 ``ensemble_size`` 指定参与融合的最优模型的数量。当 ``ensemble_size`` 设置为 ``None`` 时则表示禁用模型融合(默认)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = make_experiment(train_data,\n",
    "                            ensemble_size==10,\n",
    "                            max_trials=100,\n",
    "                            ...)"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "29dd11bf16d40fe19950dcc2f06dd773fb6bc4491ac296fb211bfed7a4a532da"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}